Main Features of Interest
I’m curious of the correlation between income related variables, loan status, and credit score.
Prosper, a peer-to-peer lending company, kept a dataset that contained 113,937 loans with 81 variables per loan. The database can be downloaded from this url: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv
I chose this dataset to help increase my own personal financial literacy and health. Mortgages can make or break someone’s adult life; and either leave them happy and care free or crippled under debt. Through this exploration I hope to get a better understanding of what variables seem to influence the interest rate and the loan amount.
Since there’s no variable for loan amount approved, I’ll mainly focus on variables that influence borrower rates.
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## ListingKey ListingNumber
## 0 0
## ListingCreationDate CreditGrade
## 0 0
## Term LoanStatus
## 0 0
## ClosedDate BorrowerAPR
## 0 25
## BorrowerRate LenderYield
## 0 0
## EstimatedEffectiveYield EstimatedLoss
## 29084 29084
## EstimatedReturn ProsperRating..numeric.
## 29084 29084
## ProsperRating..Alpha. ProsperScore
## 0 29084
## ListingCategory..numeric. BorrowerState
## 0 0
## Occupation EmploymentStatus
## 0 0
## EmploymentStatusDuration IsBorrowerHomeowner
## 7625 0
## CurrentlyInGroup GroupKey
## 0 0
## DateCreditPulled CreditScoreRangeLower
## 0 591
## CreditScoreRangeUpper FirstRecordedCreditLine
## 591 0
## CurrentCreditLines OpenCreditLines
## 7604 7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## 697 0
## OpenRevolvingMonthlyPayment InquiriesLast6Months
## 0 697
## TotalInquiries CurrentDelinquencies
## 1159 697
## AmountDelinquent DelinquenciesLast7Years
## 7622 990
## PublicRecordsLast10Years PublicRecordsLast12Months
## 697 7604
## RevolvingCreditBalance BankcardUtilization
## 7604 7604
## AvailableBankcardCredit TotalTrades
## 7544 7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## 7544 7544
## DebtToIncomeRatio IncomeRange
## 8554 0
## IncomeVerifiable StatedMonthlyIncome
## 0 0
## LoanKey TotalProsperLoans
## 0 91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 91852 91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 91852 91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 91852 91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 95009 0
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination
## 96985 0
## LoanNumber LoanOriginalAmount
## 0 0
## LoanOriginationDate LoanOriginationQuarter
## 0 0
## MemberKey MonthlyLoanPayment
## 0 0
## LP_CustomerPayments LP_CustomerPrincipalPayments
## 0 0
## LP_InterestandFees LP_ServiceFees
## 0 0
## LP_CollectionFees LP_GrossPrincipalLoss
## 0 0
## LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments
## 0 0
## PercentFunded Recommendations
## 0 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount
## 0 0
## Investors
## 0
Through the use of the is.na function, we can determine the number of NAs in each variable. This helps me filter our variables with too many NA’s.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
Through the summary function on the variable: Stated Monthly Income; we learned that the median income is $4667 with an IQR of $3625. That means that outliers would be found above $12,265.5 and below $2,237.5. I’ve confingured the plot to only show stated monthly income’s between $0 and $12,266. You can see that the data is skewed right with more borrowers between the 1st quartile $3,200 and the 3rd quartile $6816.
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
## Employed Full-time Part-time Self-employed Retired
## 67316 26353 1088 6128 795
## Not employed Not available Other
## 835 5344 3806 2255
It’s hard to say what Employed means in this case. Looking at Full-Time, Part-time, and Self-Employed, which are typical employment situations; the numbers don’t seem to add up to the number of borrowers employed.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7622
Here, we go a bit deeper into Employment looking specifically at status of how long someone has been employed in months. Running the summary function on employment status duration reveals that the median length of employment in 67 months. The max of 755 is skewing the data right and possibly conflicting the graph. Finding out outliers are above 303.5, I’ve limited the x-axis to better visualize the data. Is the employment status, Employed, a situational matter having to do with a certain length of months employed? This may need further scrutiny.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
It’s interesting to note that there are several spikes on certain locations of the data. I changed the X-axis to better determine why that might be. At first, I believed they may be at $5000 intervals, it seems the first 3 spikes are at $3500, $7000, and $10,500. I’m not sure why this may be, this needs further scrutiny. Also, it seems at least 75% of the loans are under $12,000.
## AA A B C D E HR NC
## 3507 3314 4389 5648 5153 3288 3508 141 84972
Several thousand loans, don’t have a credit grade assigned.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
All three histograms look fairly similar. The first two hisograms represent the lower and upper range of credit scores, while the third represents the average between the lower and upper range. Most average credit scores seem to be in between 600 and 800.
## # A tibble: 2,294 x 2
## BorrowerRate Count
## <dbl> <int>
## 1 0.318 3671
## 2 0.350 1905
## 3 0.320 1648
## 4 0.290 1508
## 5 0.270 1319
## 6 0.150 1182
## 7 0.140 1035
## 8 0.110 949
## 9 0.200 907
## 10 0.158 806
## # ... with 2,284 more rows
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The graph is almost right skewed (if it didn’t have the peak around .36). The interest rates don’t show units though, and most likely the interest rates aren’t between 5% to 35%, so we’d have to assume they are between 0.5% to 3.5%.
##
## (-0.000498,0.0622] (0.0622,0.124] (0.124,0.187]
## 861 22753 34820
## (0.187,0.249] (0.249,0.311] (0.311,0.373]
## 24966 19781 10730
## (0.373,0.435] (0.435,0.498]
## 3 6
Here cutting the variable, turns the rates into categorical variables and then can determine the counts of the rates between each category. Although the highest count borrower rate is .3177, regardless of the differing level of breaks, there are other rates, which more counts seem to fall under.
## Completed FinalPaymentInProgress Current
## 38066 205 56569
## PastDue Defaulted Chargedoff
## 2067 5018 11990
## Cancelled
## 5
Most loans to seem to be current or completed. Charged off is also notable here. It’s curious that 5 loans have been canceled.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2251.5
After accounting for outliers, most of the monthly loan payments seem to be lower than approximately $371.6.
The original dataset contained 113,937 observations of 81 variables. Some of these variables include:
- loan amount - most are under $12,000,
- credit grade of the loan (mostly missing),
- loan status (mainly completed, current, or defaulted),
- employment status (high number of employed numbers, but unable to gather how the number was reached),
- employment status duration (median 67 months),
- borrower credit score (mainly between 600 and 800),
- borrower interest rate (mainly .3177, but categorically a higher number are between 0.124 - 0.187), and
- monthly loan payments are lower than $371.6
I’m curious of the correlation between income related variables, loan status, and credit score.
Everyone is different (interest rates, monthly income, monthly loan payments, employment status, occupation, and credit score). Depending on how these stack up, these variables can affect loan repayment and credit score. For example, people with lower monthly incomes are possibly correlated to low or no credit.
I created two new variables:
BorrowerRate.bucket(using the cut function to turn borrower rate into a categorical variable) CreditScoreAverage (a function of CreditScoreRangeLower and CreditScoreRangeUpper).
The StatedMonthlyIncome distribution was right-skewed with a median of $4667. There seemed to be multiple outliers, especially since the max monthly income was $1,750,003. Credit score average and borrower rate were normally distributed with a median of 689.5 and 0.184 respectively.
The first thing I’ll do is get a quick high-level overview of which interesting numeric values are correlated.
Using the above functions, we are able to determine 4 quantitative variables that have strong correlations with each other (positively or negatively):
- BorrowerRate and EstimatedReturn (0.83)
- ProsperScore and BorrowerRate (-0.66)
- OpenCreditLines and TotalCreditLinespast7years (0.57)
- CreditScoreAverage and BorrowerRate (-0.53)
##
## Call:
## lm(formula = log(filter_monthlyIncome$StatedMonthlyIncome) ~
## filter_monthlyIncome$CreditScoreAverage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1383 -0.3443 0.0184 0.3700 3.5094
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 6.883e+00 1.908e-02 360.77
## filter_monthlyIncome$CreditScoreAverage 2.255e-03 2.733e-05 82.53
## Pr(>|t|)
## (Intercept) <2e-16 ***
## filter_monthlyIncome$CreditScoreAverage <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6054 on 111633 degrees of freedom
## (578 observations deleted due to missingness)
## Multiple R-squared: 0.05751, Adjusted R-squared: 0.0575
## F-statistic: 6811 on 1 and 111633 DF, p-value: < 2.2e-16
##
## Pearson's product-moment correlation
##
## data: log(filter_monthlyIncome$StatedMonthlyIncome) and filter_monthlyIncome$CreditScoreAverage
## t = 82.532, df = 111630, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2342713 0.2453288
## sample estimates:
## cor
## 0.2398078
After transforming the data, the graph is more linearlized and shows a bit more correlation.
##
## Call:
## lm(formula = LoanOriginalAmount ~ StatedMonthlyIncome, data = subset(loanData,
## StatedMonthlyIncome > 10))
##
## Residuals:
## Min 1Q Median 3Q Max
## -57144 -4171 -1685 3179 25096
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.143e+03 2.953e+01 174.1 <2e-16 ***
## StatedMonthlyIncome 5.713e-01 4.242e-03 134.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5796 on 112211 degrees of freedom
## Multiple R-squared: 0.1392, Adjusted R-squared: 0.1392
## F-statistic: 1.814e+04 on 1 and 112211 DF, p-value: < 2.2e-16
##
## Pearson's product-moment correlation
##
## data: tmp$LoanOriginalAmount and log(tmp$StatedMonthlyIncome)
## t = 155.7, df = 112210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4166707 0.4262938
## sample estimates:
## cor
## 0.4214941
The log of monthly income and loan amount show a linear relationship. Their seems to be horizontal lines at or near intervals of $5,000 up to approximately $25,000. This may show when loan amounts are rounded, as this makes sense as most people tend to think in terms of round numbers (this isn’t how much the borrower is approved for, just how much they asked for).
##
## Pearson's product-moment correlation
##
## data: tmp$BorrowerRate and tmp$StatedMonthlyIncome
## t = -66.796, df = 113280, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2002596 -0.1890543
## sample estimates:
## cor
## -0.1946633
##
## Pearson's product-moment correlation
##
## data: tmp$BorrowerRate and tmp$CreditScoreAverage
## t = -174.47, df = 112700, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4657252 -0.4565316
## sample estimates:
## cor
## -0.4611407
##
## Pearson's product-moment correlation
##
## data: tmp$BorrowerRate and tmp$ProsperScore
## t = -248, df = 84336, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6532948 -0.6454893
## sample estimates:
## cor
## -0.6494092
##
## Pearson's product-moment correlation
##
## data: tmp$BorrowerRate and tmp$EstimatedReturn
## t = 412.46, df = 84336, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8154121 0.8198859
## sample estimates:
## cor
## 0.8176613
Above you see four graphs. Plot 1: Looks at Stated Monthly Income compared to Borrower Rate. There seems to be some negative correlation of -0.19. Plot 2: Looks at Average Credit Score compared to Borrower Rate. There is a negative correlation of -0.46 (as credit score increases, there is a correlation that borrower rate decreases). Plot 3: Looks at Prosper Score (NA values are ignored) compared to Borrower Rate. There is a strong negative correlation of -0.65 (as Prosper score increases, borrower rates decrease). Plot 4: Looks at Estimated Return and how much does Borrower rate affect it. There is a strong positive correlation of .82. Estimated return could be a function for borrower rate. This needs more scrutiny.
Here, median credit scores seem to illustrate that lower credit scores have a higher chance of defaulting or being charged off.
It seems that after a prosper score of 10, credit score decreases. People with a good credit score, tend to have a prosper score of 10.
The log of monthly income and loan amount show a linear relationship. Their seems to be horizontal lines at or near intervals of $5,000 up to approximately $25,000. Their is a correlation found using the log of monthly income at 0.42. From this data, we can gather that for those who have a monthly income of less than $8000, their loan amounts tend not to be greater than $25,000.
Visually, there seemed to be little or no correlation between stated monthly income and borrower rate. By running a correlation test, there seems to be a slight correlation of -0.19.
Average Credit Score, Prosper Score, and Estimated Return had correlation scores of the following: -0.46, -0.65, and 0.82. Judging from the higher negative correlations we can associate that borrower rate decreases as average credit score and prosper score increase. On the other hand, estimated return has a strong positive correlation with borrower rate, meaning as borrower rate increases, estimated return increases.
Lower credit scores seems to relate to having a higher chance of defaulting or charging off.
Up until approximately a prosper score of 10 their seems to be a possible relationship between prosper score and credit score.
The strongest relationship I found was between BorrowerRate and EstimatedReturn. Also there was one between ProsperScore and BorrowerRate.
There seem to be many completed loans where the original loan amount is fairly low. As the original loan amount increases and stated monthly income is lower than $10,000 you can see that Defaulted and Charged off loans are occuring more frequently.
Here, we color the data for Employment Status. For the most part, the data is what is expected: Retired, not employed, not available have less than $3000, with most near $0.
I’ve gone ahead and plotted out two versions of the map, one with the added employment status of employed and the other without. With the added employment status of employed, with a stated monthly income of approximately $12000 or less, the loan original amount seems to be below $20000. There are added noticable horizontal lines for both graphs, most likely because of rounding.
Without the employed employment status, the picture is a bit clearer. Most part time incomes under $2000 seem to ask for less than $10000 on their loans. Self-Employed incomes on the other hand seem to ask in intervals of approximately $4000, $10000, and $15000.
These plots allow us to see how Prosper has collected data from 2006 to 2014. It seems employment status data wasn’t kept in 2006. In 2007, we don’t see the ‘Employed’ or ‘Other’ status data points. In 2009, there’s very few loans. Then in 2011, we see that the plots begin to increase desnity again.
Over time, we see an increase in the minimum credit score average. It seems that most loans in 2006 were under $10000. This seems to continue until around 2010, when you begin see higher loan amounts around $20000. It isn’t until 2013, we start seeing loan amounts reaching $30,000.
After 2010, the number of higher income individuals that Prosper has made money from increased.
##
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loanData)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScoreAverage,
## data = loanData)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScoreAverage +
## ListYear, data = loanData)
##
## =======================================================================
## m1 m2 m3
## -----------------------------------------------------------------------
## (Intercept) 0.317*** 0.645*** 0.692***
## (0.001) (0.003) (0.002)
## ProsperScore -0.020*** -0.017*** -0.019***
## (0.000) (0.000) (0.000)
## CreditScoreAverage -0.000*** -0.000***
## (0.000) (0.000)
## ListYear: 2010/2009 0.010***
## (0.001)
## ListYear: 2011/2009 -0.000
## (0.001)
## ListYear: 2012/2009 -0.006***
## (0.001)
## ListYear: 2013/2009 -0.058***
## (0.001)
## ListYear: 2014/2009 -0.080***
## (0.001)
## -----------------------------------------------------------------------
## R-squared 0.422 0.506 0.678
## adj. R-squared 0.422 0.506 0.678
## sigma 0.057 0.052 0.042
## F 61977.291 43391.684 25570.970
## p 0.000 0.000 0.000
## Log-likelihood 123062.508 129685.214 147929.100
## Deviance 273.046 233.579 151.934
## AIC -246119.017 -259362.427 -295840.199
## BIC -246090.971 -259325.033 -295756.063
## N 84841 84841 84841
## =======================================================================
The model above helps to explain the variance in borrower rate. It seems that the variables: Prosper score, credit score average, and listing year are the variables, which seem to make up the model. Other variables were tested, but didn’t seem to have a significant effect.
We saw a positive relationship between loan amount and year of the loan. Other than this, we didn’t discover any new relationships, but did learn about the nuances of previously discovered relationships. For example:
- Many ‘Completed’ loans where LoanOriginalAmount is lower.
- Defaulted and ChargedOff loans seem to occur when the StatedMonthlyIncome is low and the LoanOriginalAmount is high.
- Most part time incomes under $2000 seem to ask for less than $10000 on their loans.
- Self-employed people seemed to mainly ask for $4000, $10,000 or $15,000.
- Employment status data wasn’t kept in 2006. In 2007, we don’t see the ‘Employed’ or ‘Other’ status data points.
- In 2009, there’s very few loans. Then in 2011, we see that the plots begin to increase desnity again.
- We see very few loans in 2009, after the 2008 financial crisis. 2010 picks back up at a bit. In 2011 the graph is dense again, with many self-employed loans.
- Over time, we see an increase of the credit score average. It seems that most loans in 2006 were under $10000. This seems to continue until around 2010, when you begin see higher loan amounts around $20000. It isn’t until 2013, we start seeing loan amounts reaching $30,000.
- After 2010 Prosper made more money from higher income individuals.
A model was made to explain the borrower rate variance. Prosper Score, Credit Score and Listing Year were used and explained 68% of the variance. Other variables were used, but they either had no effect or a negative effect.
This plot gives a bird’s eye view of the correlation between multiple quantitative variables. Here, we see that these varibles are strongly correlated:
- BorrowerRate and EstimatedReturn (0.83)
- ProsperScore and BorrowerRate (-0.66)
- OpenCreditLines and TotalCreditLinespast7years (0.57)
- CreditScoreAverage and BorrowerRate (-0.53)
I’ve gone ahead and plotted out two versions of the map, one with the added employment status of employed and the other without. With the added employment status of employed, with a stated monthly income of approximately $12000 or less, the loan original amount seems to be below $20000. There are added noticable horizontal lines for both graphs, most likely because of rounding.
Without the employed employment status, the picture is a bit clearer. Most part time incomes under $2000 seem to ask for less than $10000 on their loans. Self-Employed incomes on the other hand seem to ask in intervals of approximately $4000, $10000, and $15000.
Over time, we see an increase in the minimum credit score average. It seems that most loans in 2006 were under $10000. This seems to continue until around 2010, when you begin see higher loan amounts around $20000. It isn’t until 2013, we start seeing loan amounts reaching $30,000.
I began this exploration with the Prosper Data set to better determine what variables seem to have the most leverage on lowering interest rate, while achieving a higher loan amount. Since the dataset didn’t have a loan amount approved variable, the efforts were focused towards borrower rate instead. It seems that credit score plays the biggest factor in achieving a lower interest rate. Interest rates seem to rise and fall through the years, and is hard to predict through this dataset alone.
The univariate analysis allowed me to determine what variables would be most suitable for my analysis. Afterwards, the bivariate analysis allows me to see correlations between the data variable pairs, allowing to better determine relationships (if any). Some interesting relationships to note: Higher prosper scores tend to lead to lower borrower rate. Reasonably, high credit scores tend to lead to lower borrower rate’s as well.
It’s hard determine what data variables really lead towards the right direction, and part of it determines on how we as the analyst tend to interpret the data and the variables as well. Often, I went back and forth between what variables make sense to illustrate versus which ones aren’t needed.
If there was more time spent on the dataset, I’d like to see if country location increases or decreases borrower rate and if number of dependents increases or decreases the percentage of completing loans. I’d also like look more deeply into credit lines, inquiries, and debt-to-income ratios overall.